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Abstract 

For some images, descriptions written by multiple peo¬ 
ple are consistent with each other. But for other images, de¬ 
scriptions across people vary considerably. In other words, 
some images are specific — they elicit consistent descriptions 
from different people - while other images are ambiguous. 
Applications involving images and text can benefit from an 
understanding of which images are specific and which ones 
are ambiguous. For instance, consider text-based image re¬ 
trieval. If a query description is moderately similar to the 
caption (or reference description) of an ambiguous image, 
that query may be considered a decent match to the image. 
But if the image is very specific, a moderate similarity be¬ 
tween the query and the reference description may not be 
sufficient to retrieve the image. 

In this paper, we introduce the notion of image speci¬ 
ficity. We present two mechanisms to measure specificity 
given multiple descriptions of an image: an automated mea¬ 
sure and a measure that relies on human judgement. We 
analyze image specificity with respect to image content and 
properties to better understand what makes an image spe¬ 
cific. We then train models to automatically predict the 
specificity of an image from image features alone without 
requiring textual descriptions of the image. Finally, we 
show that modeling image specificity leads to improvements 
in a text-based image retrieval application. 

1. Introduction 

Consider the two photographs in Figure 1. How would 
you describe them? For the first, phrases like “people lined 
up in terminal”, “people lined up at train station”, “people 
waiting for train outside a station”, etc. come to mind. It is 
clear what to focus on and describe. In fact, different people 
talk about similar aspects of the image - the train, people, 
station or terminal, lining or queuing up. But for the photo¬ 
graph on the right, it is less clear how it should be described. 
Some people talk about the the sunbeam shining through the 
skylight, while others talk about the alleyway, or the people 
selling products and walking. In other words, the photo¬ 
graph on the left is specific whereas the photograph on the 
right is ambiguous. 



"people lined up In terminal" "alleyway in a small town" 
"people lined up at train station" "People sitting and walking" 
"long line at a station" "man walking in shopping area 
"people waiting for train with others selling products" 
outside a station" "sunbeam shining through skylight" 


Figure 1. Some images are specific - they elicit consistent de¬ 
scriptions from different people (left). Other images (right) are 
ambiguous. 


The computer vision community has made tremendous 
progress on recognition problems such as object detec¬ 
tion [12, 16], image classification [26], attribute classifi¬ 
cation [48] and scene recognition [50, 52]. Various ap¬ 
proaches are moving to higher-level semantic image under¬ 
standing tasks. One such task that is receiving increased 
attention in recent years is that of automatically generat¬ 
ing textual descriptions of images [5, 8, 13, 14, 23, 25, 
28, 31, 35, 37, 38, 47, 51] and evaluating these descrip¬ 
tions [11, 32, 38, 39]. However, these works have largely 
ignored the variance in descriptions produced by differ¬ 
ent people describing each image. In fact, early works 
that tackled the image description problem [14] or reasoned 
about what image content is important and frequently de¬ 
scribed [3] claimed that human descriptions are consistent. 
We show that there is in fact variance in how consistent mul¬ 
tiple human-provided descriptions of the same image are. 
Instead of treating this variance as noise, we think of it as 
a useful signal that if modeled, can benefit applications in¬ 
volving images and text. 

We introduce the notion of image specificity which mea¬ 
sures the amount of variance in multiple viable descrip¬ 
tions of the same image. Modeling image specificity can 
benefit a variety of applications. For example, computer¬ 
generated image description and evaluation approaches can 
benefit from specificity. If an image is known to be am- 
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biguous, several different descriptions can be generated and 
be considered to be plausible. But if an image is specific, a 
narrower range of descriptions may be appropriate. Photog¬ 
raphers, editors, graphics designers, etc. may want to pick 
specific images - images that are likely to have a single (in¬ 
tended) interpretation across viewers. 

Given multiple human-generated descriptions of an im¬ 
age, we measure specificity using two different mecha¬ 
nisms: one requiring human judgement of similarities be¬ 
tween two descriptions, and the other using an automatic 
textual similarity measure. Images with a high average sim¬ 
ilarity between pairs of sentences describing the image are 
considered to be specific, while those with a low average 
similarity are considered to be ambiguous. We then analyze 
the correlation between image specificity and image con¬ 
tent or properties to understand what makes certain images 
more specific than others. We find that images with people 
tend to be specific, while mundane images of generic build¬ 
ings or blue skies do not tend to be specific. We then train 
models that can predict the specificity of an image just by 
using image features (without associated human-generated 
descriptions). Finally, we leverage image specificity to im¬ 
prove performance in a real-world application: text-based 
image retrieval. 

2. Related work 

Image properties: Several works study high-level image 
properties beyond those depicted in the image content it¬ 
self. For instance, unusual photographs were found be in¬ 
teresting [ 1 7] and images of indoor scenes with people were 
found to be memorable, while scenic, outdoor scenes were 
not [18, 19]. Other properties of images such as aesthet¬ 
ics [ ], attractiveness [30], popularity [24], and visual clut¬ 
ter [42] have also been studied'. In this paper, we study a 
novel property of images - specificity - that captures the 
degree to which multiple human-generated descriptions of 
an image vary. We study what image content and properties 
make images specific. We go a step further and leverage 
this new property to improve a text-based image retrieval 
application. 

Importance: Some works have looked at what is worth 
describing in an image. Bottom-up saliency models [20, 22] 
study which image features predict eye fixations. Impor¬ 
tance [3, 44] characterizes the likelihood that an object in 
an image will be mentioned in its description. Attribute 
dominance [45] models have been used to predict which at¬ 
tributes pop out and the order in which they are likely to be 
named. However, unlike most of these works, we look at the 
variance in human perception of what is worth mentioning 
in an image and how it is mentioned. 

*Our work is complementary to visual metamers [15]. In visual 
metamers, different images are perceived similarly but in specificity, we 
study how the same image can be perceived differently, and how this vari¬ 
ance in perception differs across images. 


Image description: Several approaches have been pro¬ 
posed for automatically describing images. This paper does 
not address the task of generating descriptions. Instead, it 
studies a property of how humans describe images - some 
images elicit consistent descriptions from multiple people 
while others do not. This property can benefit image de¬ 
scription approaches. Some image description approaches 
are data-driven. They retrieve images from a database that 
are similar to the input image, and leverage descriptions as¬ 
sociated with the retrieved images to describe the input im¬ 
age [14, 38]. In such approaches, knowledge of the speci¬ 
ficity of the input image may help guide the range of the 
search for visually similar images. If the input image is 
specific, perhaps only highly similar images and their asso¬ 
ciated descriptions should be used to construct its descrip¬ 
tion. Other approaches analyze the content of the image 
and then compose descriptive sentences using knowledge 
of sentence structures [28, 37]. For images that are ambigu¬ 
ous, the model can predict multiple diverse high-scoring 
descriptions of the image that can all be leveraged for a 
downstream application. Finally, existing automatic im¬ 
age description evaluation metrics such as METEOR [1], 
ROUGE [32], BLEU [39] and CIDEr [46] compare a gen¬ 
erated description with human-provided reference descrip¬ 
tions of the image. This evaluation protocol does not ac¬ 
count for the fact that some images have multiple viable 
ways in which they can be described. Perhaps the penalty 
for not matching reference descriptions of ambiguous im¬ 
ages should be less than for specific ones. 

Image retrieval: Query- or text-based image and video 
retrieval approaches evaluate how well a query matches the 
content of [2, 10, 34, 43] or captions (descriptions) associ¬ 
ated with [2, 29] images in a database. However, the fact 
that each image may have a different match score or simi¬ 
larity that is sufficient to make it relevant to a query has not 
been studied. In this work, we use image specificity to fill 
this gap. While the role of lexical ambiguity in information 
retrieval has been studied before [27], reasoning about in¬ 
herent ambiguity in images for retrieval tasks has not been 
explored. 

3. Approach 

We first describe the two ways in which we measure the 
specificity of an image. We then describe how we use speci¬ 
ficity in a text-based image retrieval application. 

3.1. Measuring Specificity 

We define the specificity of an image as the average sim¬ 
ilarity between pairs of sentences describing the image. Eor 
each image i, we are given a set 5® of N sentence descrip¬ 
tions {sj,..., s5v}. We measure the similarity between all 
possible pairs of sentences and average the scores. The 
similarity between two sentences can either be judged by 
humans or computed automatically. 
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Specificity = 0.89 



There is a lot of snow on the mountain. 
There is a snow covered mountain. 

A snow covered mountain. 

A mountain with snow. 

A snowy mountain. 



Specificity = 0.59 


Specificity = 0.37 


Children play racing games in an arcade. 
A group of kids playing games. 

A few kids playing arcade games, 
some kids in an arcade. 

Kids are playing racing games. 


A house with a porch. 

There is a railing around the porch 
of the house. 

House with really green grass. 

A view of a small white and blue house, 
a house shown from outside. 


Specificity = 0.11 



People waiting at an airport. 

The interior of a building with a sloped roof, 
the inside of airport. 

A decadent room with people walking around. 
A large bowling rink. 


Figure 2. Example images with very low to very high human-annotated specificity scores. 


3.1.1 Human Specificity Measurement 

M different subjects on Amazon Mechanical Turk (AMT) 
were asked to rate the similarity between a pair of sentences 
and si, on a scale of 1 (very different) to 10 (very sim¬ 
ilar). Note that subjects were not shown the corresponding 
image and were not informed that the sentences describe the 
same image. This ensured that subjects rated the similarity 
between sentences based solely on their textual content. We 
shift and scale the similarity scores to lie between 0 and 1. 
We denote this similarity, as assessed by the m-th subject to 
be 

The average similarity score across all pairs of sentences 
and subjects gives us the specificity score for im¬ 

age i based on human perception. For ease of notation, we 
drop the superscript i when it is clear from the context. 

1 ^ 

(2) V{s„,st}cSm=l 

Figure 2 shows images with their human-annotated 
specificity scores. Note how the specificity score drops as 
the sentence descriptions become more varied. 

3.1.2 Automated Specificity Measurement 

To measure specificity automatically given the N descrip¬ 
tions for image i, we first tokenize the sentences and only 
retain words of length three or more. This ensured that se¬ 
mantically irrelevant words, such as ‘a’, ‘of’, etc., were not 
taken into account in the similarity computation (a standard 
stop word list could also be used instead). We identified the 
synsets (sets of synonyms that share a common meaning) 
to which each (tokenized) word belongs using the Natural 
Language Toolkit [4]. Words with multiple meanings can 
belong to more than one synset. Let Yau = {Vau} be the set 
of synsets associated with the u-th word from sentence Sa- 
Every word in both sentences contributes to the automat¬ 
ically computed similarity simauto(sa, Sfi) between a pair 
of sentences Sa and Sb- The contribution of the u-th word 
from sentence Sa to the similarity is Cau- This contribution 


is computed as the maximum similarity between this word, 
and all words in sentence st (indexed by v). The similar¬ 
ity between two words is the maximum similarity between 
all pairs of synsets (or senses) to which the two words have 
been assigned. We take the maximum because a word is 
usually used in only one of its senses. Concretely, 

Ca„ = max max max simsense(,yau,ybv) (2) 


The similarity between senses simsense{yau,ybv) is the 
shortest path similarity between the two senses on Word- 
Net [36]. We can similarly define Cbv to be the con¬ 
tribution of z;-th word from sentence Sb to the similarity 
s*mauto(soi Sb) between sentences Sa and Sb- 

The similarity between the two sentences is defined as 
the average contribution of all words in both sentences, 
weighted by the importance of each word. Let the im¬ 
portance of the w-th word from sentence Sa be tau- This 
importance is computed using term frequency-inverse doc¬ 
ument frequency (TF-IDF) using the scikit-learn software 
package [40]. Words that are rare in the corpus but occur 
frequently in a sentence contribute more to the similarity of 
that sentence with other sentences. So we have 


siuiQ,uto(.Sa: Sb) 


(3) 


The denominator in Equation 9 ensures that the similar¬ 
ity between two sentences is independent of sentence-length 
and is always between 0 and 1. Einally, the automated speci¬ 
ficity score of an image i is computed by averaging 

these similarity scores across all sentence pairs; 

SFeCauto=7W X sim^utoisa,Sb) (4) 

\ 2 ) V{s<,,S6}cS 

The reader is directed to the supplementary material [2 1 ] 
for a pictorial explanation of automated specificity compu¬ 
tation. 
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3.2. Application: Text-based image retrieval 

We now describe how we use image specificity in a text- 
based image retrieval application. 

3.2.1 Setup 

There is a particular image the user is looking for from a 
database of images. We call this the target image. The user 
inputs a query sentence q that describes the target image. 
Every image in the database is associated with a single ref¬ 
erence description (not to be confused with the “training” 
pool of sentences S'* described in Section 3.1 used to define 
the specificity of an image). This can be, for example, the 
caption in an online photo database such as Flickr. The goal 
is to sort the images in the database according to their rel¬ 
evance score reT from most to least relevant, such that the 
target image has a low rank. 

3.2.2 Baseline Approach 

The baseline approach automatically computes a similarity 
sirriautoiq, Ti) between q and using Equation 9. All im¬ 
ages in the database are sorted in descending order using 
this similarity score. That is, 

’’eCseline = (5) 

The image whose reference sentence has the highest sim¬ 
ilarity to the query sentence gets ranked first while the im¬ 
age whose reference sentence has the lowest similarity to 
the query sentence gets ranked last. 

3.2.3 Proposed Approach 

In the proposed approach, instead of ranking just by simi¬ 
larity between the query sentence and reference descriptions 
in the database, we take into consideration the specificity of 
each image. The rationale is the following; a specific image 
should be ranked high only if the query description matches 
the reference description of that image well, because we 
know that sentences that describe this image tend to be very 
similar. For ambiguous images, on the other hand, even 
mediocre similarities between query and reference descrip¬ 
tions may be good enough. 

This suggests that instead of just sorting based on 
sirriautoiq, Ti), the similarity between the query description 
q and the reference description of an image i (which is 
what the baseline approach does as seen in Equation 5), we 
should model P(match|szmauto(9j ft)) which captures the 
probability that the query sentence matches the reference 
sentence i.e., the query sentence describes the image. We 
use Logistic Regression (LR) to model this. 

re(gt_,pecificity = P(match|simauto(9,r*)) 

_1_1 (6) 

I _|_ g-/35-/3Jsimauto(</.'ri) j 

For each image in the database, we train the above LR 
model. Positives examples of this model are the similarity 
scores between pairs of sentences both describing the image 
i taken from the set described in Section 3.1. Negative 


examples are similarity scores between pairs of sentences 
where one sentence describes the image i but the other does 
not. If there are N descriptions available for each image 
during training, we have positive examples (all pairs 
of N sentences). We generate a similar number of negative 
examples by pairing each of the N descriptions with [ 
descriptions from other images. [.] is the ceiling function. 

The parameters of this LR model, /SJ and /3j, inherently 
capture the specificity of the image. Note that a separate 
LR model is trained for each image to model the specificity 
for that image. After these models have been trained, given 
a new query description q, the similarity simauto{q,'i'i) is 
computed with every reference description in the dataset. 
The trained LR for each image, i.e. the parameters and 
PI, can be used to compute P(match|s*mauto (<?5 ?’i)) for 
that image. All images can then be sorted by their corre¬ 
sponding P(match|simauto( 9 , t’i)) values. In our exper¬ 
iments, unless mentioned otherwise, the query and refer¬ 
ence descriptions being used at test time were not part of 
the training set used to train the LRs. 

3.2.4 Predicting Specificity of Images 

The above approach needs several sentences per image to 
obtain positive and negative examples to train the LR. But 
in realistic scenarios, it may not be viable to collect multi¬ 
ple sentences for every image in the database. Hence, we 
learn a mapping from the image features {xi} to the LR 
parameters estimated using sentence pairs. We call these 
parameters ground-truth LR parameters. We train two sep¬ 
arate i^-Support Vector Regression (SVR) models (one for 
each P term) with Radial Basis Function (RBF) kernel. 
The learnt SVR model is then used to predict the LR pa¬ 
rameters and P\ of any previously unseen image. Fi¬ 
nally, these predicted LR parameters are used to compute 
P(match|szmauto(9, ?'i)) and sort images in a database ac¬ 
cording to their relevance to a query description q. Of 
course, each image in the database still needs a (single) ref¬ 
erence description (without which text-based image re¬ 
trieval is not feasible). 

reZp„d_,pe^ifi^ity = .P(match|simauto(<?,?'i)) 

_1_1 (7) 

1 _|_ J 

Notice that the baseline approach (Equation 5) is a spe¬ 
cial case of our proposed approaches (Equations 6 and 7) 
with Pq = pQ = constanto and p\ = pl = constanti Vz, 
where the parameters for each image are the same. 

3.2.5 Summary 

Let’s say we are given a new database of images and asso¬ 
ciated (single) reference descriptions that we want to search 
using query sentences. SVRs are used to predict each of the 
two LR parameters using image features for every image in 
the database. This is done offline. When a query is issued, 
its similarity is computed to each reference description in 
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the image. Each of these similarities are substituted into 
Equation 7 to calculate the relevance of each image using 
the LR parameters predicted for that image. This query¬ 
time processing is computationally light. The images are 
then sorted by the probability outputs of their LR models. 
The quality of the retrieved results using this (proposed) ap¬ 
proach is compared to the baseline approach that sorts all 
images based on the similarity between the query sentence 
and reference descriptions. Of course, in the scenario where 
multiple reference descriptions are available for each im¬ 
age in the database, we can directly estimate the (ground- 
truth) LR parameters using those descriptions (as described 
in Section 3.2.3) instead of using the SVR to predict the LR 
parameters. We will show results of both approaches (using 
ground-truth LR parameters and predicted LR parameters). 

4. Experimental Results 

4.1. Datasets and Image Features 

We experiment with three datasets. The hrst is the 
MEM-5S dataset containing 888 images from the memora¬ 
bility dataset [19], which are uniformly spaced in terms of 
their memorability. For each of these images, we collected 5 
sentence descriptions by asking unique subjects on AMT to 
describe them. Figure 2 shows some example images and 
their descriptions taken from the MEM-5S dataset. Since 
specificity measures the variance between sentences, and 
more sentences would result in a better specihcity estimate, 
we also experiment with two datasets with 50 sentences per 
image in each dataset. One of these is the ABSTRACT-50S 
dataset [46] which is a subset of 500 images made of clip 
art objects from the Abstract Scene dataset [53] contain¬ 
ing 50 sentences/image (48 training, 2 test). We use only 
the training sentences from this dataset for our experiments. 
The second is the PASCAL-50S dataset [46] containing 50 
sentences/image for the 1000 images from the UIUC PAS¬ 
CAL dataset [41]. These datasets allow us to study speci¬ 
ficity in a wide range of settings, from real images to non- 
photorealistic but semantically rich abstract scenes. All cor¬ 
relation analysis reported in the following sections was per¬ 
formed using Spearman’s rank correlation coefficient. 

For predicting specificity, we extract 4096D DECAF- 
6 [9] features from the PASCAL-50S images. Images in the 
ABSTRACT-50S dataset are represented by the occurrence, 
location, depth, flip angle of objects, object co-occurrences 
and clip art category (451D) [53]. 

4.2. Consistency analysis 

In Section 3.1, we described two methods to measure 
specificity. In the hrst, humans are involved in annotating 
the similarity score between the sentences describing an im¬ 
age and in the second, this is done automatically. We hrst 
analyze if humans agree on their notions of specihcity, and 
then study how well human annotation of specihcity corre¬ 


lates with automatically-computed specihcity. 

Do humans rate sentence pair similarities consistently? 

The similarity of each pair of sentences in the MEM-5S 
dataset was rated by 3 subjects on AMT. This means that 
every image is annotated by ( 2 ) * 3 = 30 similarity ratings. 
The average of the similarity ratings gives us the specihcity 
scores. These ratings were split thrice into two parts such 
that the ratings from one subject was in one part and rat¬ 
ings from the other two subjects were in the other part. The 
specihcity score computed from the hrst part was correlated 
with the specihcity score of the other part. This gave an av¬ 
erage correlation coefficient of 0.72, indicating high consis¬ 
tency in specihcity measured across subjects. 

Is specihcity consistent for a new set of descriptions? 

Additionally, 5 more sentences were collected for a subset 
of 222 images in the memorability dataset. With these ad¬ 
ditional sentences, specihcity was computed using human 
annotations and the correlation with the specihcity from the 
previous set of sentences was found to be 0.54. Inter-human 
agreement on the same set of 5 descriptions for 222 images 
was 0.76. We see that specihcity measured across two sets 
of hve descriptions each is not highly consistent. Hence, 
we hypothesize that measuring specihcity using more sen¬ 
tences would be desirable (thus our use of the PASCAL-50S 
and ABSTRACT-50S datasets)^. 

How do human and automated specihcity compare? 

We hnd that the rank correlation between human-annotated 
and automatically measured specihcity (on the same set of 
5 sentences for 888 images in MEM-5S) is 0.69 which is 
very close to the inter-human correlation of 0.72. Note that 
this automated method still requires textual descriptions by 
humans. In a later section, we will consider the problem of 
predicting specihcity just from image features if textual de¬ 
scriptions are also not available. Note that in most realistic 
applications (e.g. image search, that we explore later), it is 
practical to measure specihcity by comparing descriptions 
automatically. Hence, the automated specihcity measure¬ 
ment may be the more relevant one. 

Are some images more specihc than others? Now that 
we know specihcity is well-dehned, we study whether some 
images are in fact more specihc than others. Figure 2 shows 
some examples of images whose specihcity values range 
from low to very high values. Note how the descriptions 
become more varied as the specihcity value drops. Figure 3 
shows a histogram of specihcity values on all three datasets. 
In the MEM-5S dataset, the specihcity values range from 
0.11 to 0.93^. This indicates that indeed, some images are 
specihc and some images are ambiguous. We can exploit 
this fact to improve applications such as text-based image 
retrieval (Section 4.4). 

^It was prohibitively expensive to measure human specificity for all 
pairs of 50 sentences to verify this hypothesis 

^Specificity values can fall between 0 and 1. 
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MEM-5S (human) MEM-5S (auto) 



Specificity value Specificity value 


Figure 3. A histogram of human-annotated specificity values for 
the MEM-5 S dataset (top left) and automated specificity values 
for all three datasets (rest). 

4.3. What makes an image specific? 

We now study what makes an image specific. The first 
question we want to answer is whether longer sentence de¬ 
scriptions lead to more variability and hence less specific 
images. We correlated the average length of a sentence 
(measured as the number of words in the sentence) with 
specificity, and surprisingly, found that the length of a sen¬ 
tence had no effect on specificity (p=-0.02, p-value=0.64). 
However, we did find that the more specific an image was, 
the less was the variation in length of sentences describing 
it (p=-0.16, p-value<0.01). 

Next, we looked at image content to unearth possibly 
consistent patterns that make an image specific. We cor¬ 
related publicly available attribute, object and scene anno¬ 
tations [18] for the MEM-5S dataset with our specificity 
scores. We then sorted the annotations by their correlation 
with specificity and showed the top 10 and bottom 10 cor¬ 
relations as a bar plot in Figure 4. We find that images with 
people tend to be specific, while mundane images of generic 
buildings or blue skies tend to not be specific. Note that if 
a category (e.g. person) and its subcategory (e.g. Caucasian 
person) both appeared in the top 10 or bottom 10 list and 
had very similar correlations, the subcategory was excluded 
in favour of the main category since the subcategory is re¬ 
dundant. 

Next, we hypothesized that images with larger objects 
in them may be more specific, since different people may 
all talk about those objects. Confirming this hypothesis, we 
found a correlation of 0.16 with median object area and 0.14 
with mean object area. 

We then investigated how importance [3] relates to speci- 
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Figure 4. Spearman’s rank correlation of human specificity with 
attributes, objects and scene annotations for the MEM-5S dataset. 

ficity. Since important objects in images are those that tend 
to be mentioned often, perhaps an image containing an im¬ 
portant object will be more specific because most people 
will talk about the object. We consider all sentences cor¬ 
responding to all images containing a certain object cat¬ 
egory. In each sentence, we identify the word (e.g. ve¬ 
hicle) that best matches the category (e.g. car) using the 
shortest-path similarity of the words taken from the Word- 
Net database [36]. We average the similarity between this 
best matching word in each sentence to the category name 
across all sentences of all images containing that category. 
This is a proxy for how frequently that category is men¬ 
tioned in descriptions of images containing the category. A 
similar average was found for randomly chosen sentences 
from other categories as a proxy for how often a category 
gets mentioned in sentences a priori. These two averages 
were subtracted to obtain an importance value for each cate¬ 
gory. Now, the specificity scores of all images containing an 
object category was averaged and this score was found to be 
significantly correlated (p=0.31, p-value=0.05) with the im¬ 
portance score. This analysis was done only for categories 
that were present in more than 10 images in the MEM-5S 
dataset. This shows that images containing important ob¬ 
jects do tend to be more specific. 

In another study, Isola et al. [19] measured image mem¬ 
orability. Images were flashed in front of subjects who were 
asked to press a button each time they saw the same image 
again. Interestingly, repeats of some images are more re¬ 
liably detected across subjects than other images. That is, 
some images are more memorable than others. We tested 
if memorability and specificity are related by correlating 
them and found a high correlation (p=0.33, p-value<0.01) 
between the two. Thus, specificity can explain memorabil¬ 
ity to some extent. However, the two concepts are distinct. 
For instance, peaceful, picture-perfect scenes that may ap¬ 
pear on a postcard or in a painting were found to be neg¬ 
atively correlated with memorability [ 1 8] (Ppeacefui=~0-35, 
Ppostcard=-0.31, Ppainting=-0.32). But these attributes have 
no correlation with specificity (ppeacefui=-0.05, /9painting=- 
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Figure 5. Image search results: Increasing the number of training 
sentences per image improves the mean target rank obtained with 
ground truth LR parameters (specificity). As expected, there is a 
sharp improvement when the reference sentence (green fill) or both 
the reference and query sentences (black fill) are included when 
estimating the LR parameters. The results are averaged across 25 
random repeats and the error intervals are shown in shaded col¬ 
ors. Annotations indicate the number of sentences required to beat 
baseline and the maximum improvement possible over baseline 
using all available sentences. Lower mean rank of target means 
better performance. 

0.02, Ppostcard=-0.05). Ill the supplementary material [21], 
we include examples of images that are memorable but not 
specific and vice-versa. Additional scatter plots for a subset 
of the computed correlations are also included in the supple¬ 
mentary material [21]. Finally, correlation of mean color of 
the image with specificity was pj.f.^=0.Ql, /9green=0.02 and 

Pb/ue“0.01. 

Overall, specificity is correlated with image content to 
quite an extent. In fact, if we train a regressor to predict au¬ 
tomated specificity directly from DECAF-6 features in the 
PASCAL-50S and MEM-5S dataset, we get a correlation of 
0.2 and 0.25. The correlation using semantic features in the 
ABSTRACT-50S dataset was 0.35. More details in supple¬ 
mentary material [21]. The reader is encouraged to browse 
our datasets through the websites on the authors’ webpages. 

4.4. Image search 

4.4.1 Ground truth Specificity 

Given a database of images with multiple descriptions each. 
Section 3.2.3 describes how we estimate parameters of a 
Logistic Regression (LR) model, and use the model for im¬ 
age search. In our experiments, the query sentence corre¬ 
sponds to a known target image (known for evaluation, not 
to the search algorithm). The evaluation metric is the rank 
of the target images, averaged across multiple queries. 

We investigate the effect of number of training sentences 
per image used to train the LR model on the average rank 
of target images. Ligure 5 shows that the mean rank of the 
target image decreases with increasing number of training 
sentences. The baseline approach (Section 3.2.2) simply 
sorts the images by the similarity value between the query 



Method 

Mean 

rank 

% of queries 
meet or beat BL 


BL 

50.80 

- 

PASCAL-50S 

GT-Spec 

44.70 

67.3 


P-Spec 

49.72 

13.2 


BL 

73.34 

- 

ABSTRACT-50S 

GT-Spec 

63.30 

61.0 


P-Spec 

69.41 

61.6 


Table 1. Image search results for different ranking algorithms: 
baseline (BL), specificity using ground truth (GT-Spec) LR pa¬ 
rameters, and specificity (P-Spec) using predicted LR parameters. 
The column with header Mean rank gives the rank of the target im¬ 
age averaged across all images in the database. The final column 
indicates the percentage of queries where the method does better 
than or as good as baseline. 

sentence and all reference sentences in the database (one per 
image). Lor the PASCAL-50S dataset, 17 training sentences 
per image were required to estimate an LR model that can 
beat this baseline while for ABSTRACT-50S dataset, 8 
training sentences per image were enough. The improve¬ 
ment obtained over the baseline by training on all 50 sen¬ 
tences was 1.0% of the total dataset size for PASCAL-50S 
and 2.5% for the ABSTRACT-50S datasetl With this im¬ 
provement, we bridge 20.5% of the gap between baseline 
and perfect result (target rank of 1) for PASCAL-50S and 
17.5% for ABSTRACT-50S. 

4.4.2 Predicted Specificity 

We noted in the previous section that as many as 17 training 
sentences per image are needed to estimate specificity accu¬ 
rately enough to beat baseline on the PASCAL-50S dataset. 
In a real application, it is not feasible to expect these many 
sentences per image. This leads us to explore if it is pos¬ 
sible to predict specificity directly from images accurately 
enough to beat the baseline approach. 

As described in Section 3.2.4, we train regressors that 
map image features to the LR parameters. These regressors 
are then used to predict the LR parameters that are used for 
ranking the database of test images in an image search ap¬ 
plication. We do this with leave-one-out cross-validation 
(ensuring that none of the textual descriptions of the pre¬ 
dicted image were included in the training set) so that we 
have predicted LR parameters on our entire dataset. 

Table 1 shows how often our approach does better than 
or matches the baseline. It can be seen that specificity using 
the LR parameters predicted directly using image features 
does better than or matches the baseline for 73.2% of the 
queries. This is especially noteworthy since no sentence 
descriptions were used to estimate the specificity of the im¬ 
ages. Specificity is predicted using purely image features. 

'^Our approach is general and can be applied to different automatic text 
similarity metrics. For instance, cosine similarity (dot product of the TF- 
IDF vectors) also works quite well with a 3.5% improvement using ground- 
truth specificity for ABSTRACT-50S. 
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0 2 4 6 8 10 

margin K by which baseline is beaten 


Figure 6. Image search results: On the x-axis is plotted K, the 
margin in rank of target image by which baseline is beaten, and on 
the y-axis is the percentage of queries where baseline is beaten by 
at least K. 

From Table 1, we note that predicted specificity (P-Spec) 
loses less often to baseline as compared to ground-truth 
specificity (GT-Spec), but GT-Spec still has a better aver¬ 
age rank of target images compared to P-Spec. The reason 
is that GT-Spec does much better than P-Spec on queries 
where it wins against baseline. Therefore, we would like 
to know that when an approach beats baseline, how often 
does it beat baseline by a low or high margin? Figure 6 
shows the percentage of queries which beat baseline by dif¬ 
ferent margins. The x-axis is the margin by at least which 
the baseline is beaten and on the y-axis is the percentage of 
queries. As expected, ground-truth specificity performs the 
best amongst the three methods. But even predicted speci¬ 
ficity often beats the baseline by large margins. 

Many approaches [6, 49] retrieve images based on text- 
based matches between the query and reference sentences, 
and then re-rank the results based on image content. This 
content-based re-ranking step can be performed on top of 
our retrieved results as well. Note that in our approach, the 
image features are used to modulate the similarity between 
the query and reference sentences - and not to better assess 
the match between the query sentence and image content. 
These are two orthogonal sources of information. 

Finally, Figure 7 shows a qualitative example from the 
ABSTRACT-50S dataset. In the hrst image, the query and 
the reference sentence for the target image do not match 
very closely. However, since the image has a low automated 
specificity, this mediocre similarity is sufficient to lower the 
rank of the target image. 

5. Discussion and Conclusion 

We introduce the notion of specihcity. We present evi¬ 
dence that the variance in textual descriptions of an image, 
which we call specihcity, is a well-dehned phenomenon. 
We hnd that even abstract scenes which are not photoreal¬ 
istic capture this variation in textual descriptions. We study 


Baseline: 386 Baseline: 5 

GT-Spec: 195, P-Spec: 291 GT-Spec: 6, P-Spec: 9 



Boy, wearing a viking hat, is angrily 
chasing girl, who is holding a 
cheeseburger {s/ma^,„=0.28) 


Girl and boy play with a football at the 
park (sim^^,t^=0.76) 


Specificity=0.56 


Specificity=0.79 


Girl cries while she runs away from 
an angry boy. 


Figure 7. Qualitative image search results from the ABSTRACT- 
50S dataset. The images are the target images. Query sentences 
are shown in the blue box below each image (along with their au¬ 
tomated similarity to the reference sentence). The reference sen¬ 
tence of the image is shown on the image. The automated speci¬ 
ficity value is indicated at the top-left comer of the images. Green 
border (left) indicates that both predicted and ground-tmth speci¬ 
ficity performed better than baseline, and red border (right) indi¬ 
cates that baseline did better than both P-Spec and GT-Spec. The 
rank of the target image using baseline, GT-Spec and P-Spec is 
shown above the image. 


various object and attribute-level properties that influence 
specificity. More importantly, modeling specificity can ben¬ 
efit various applications. We demonstrate this empirically 
on a text-based image retrieval task. We use image features 
to predict the parameters of a classiher (Logistic Regres¬ 
sion) that modulates the similarity between the query and 
reference sentence differently for each image. Future work 
involves exploring robust measures of specificity that con¬ 
sider only representative sentences (not outliers), investigat¬ 
ing other similarity measures such as Lin’s similarity [33] or 
word2vec^ when measuring specificity, exploring the poten¬ 
tial of low-level saliency and objectness maps in predicting 
specificity, studying specificity in more controlled settings 
involving a closed set of visual concepts and using image 
specificity in various applications such as image tagging to 
determine how many tags to associate with an image (few 
for specific images and many for ambiguous images), image 
captioning, etc. Our data and code are publicly available on 
the authors’ webpages. 


Acknowledgements: This work was supported in part by 
The Paul G. Allen Family Foundation Allen Distinguished 
Investigator award to D.P. 


^http : //code . google . com/p/word2vec/ 


8 



















Image Specificity 
(Supplementary material) 


Mainak Jas 
Aalto University 

mainak.jas@aalto.fi 


Devi Parikh 
Virginia Tech 

parikh@vt.edu 


6. Scatter plots for correlations 


Spearman’s p = 0.69, p-value < 0.01 

1 



or-.-.-.-^-, 

0 0.2 0.4 0.6 0.8 1 

Measured specificity 


Figure 8. Correlation between human-measured specificity and au¬ 
tomated specificity for the MEM-5S dataset. 

In the main paper, 'we described ho'w automated speci¬ 
ficity correlated 'with human-measured specificity. Figure 8 
further illustrates this using a scatter plot. We also studied 
how various image properties correlated with specificity. In 
Figure 9, we illustrate these correlations via scatter plots. 

7. Predicting specificity 

As we have shown, certain image-level objects and at¬ 
tributes make some images more specific than others. This 
means that specificity may be predictable using image fea¬ 
tures alone. 

To test this, a ly-SVR with an RBF kernel is trained on 
a randomly chosen subset of images represented by their 
DECAF-6 features [9] in the MEM-5S and PASCAL-50S 
datasets. In the ABSTRACT-50S dataset, the image fea¬ 
tures are a concatenation of object occurrence, their ab¬ 
solute position, depth, flip angle, object co-occurrence, 
and clip art category [53]. For prediction, 188 images 
are set aside in the MEM-5S dataset, 200 images in the 
PASCAL-50S dataset, and 100 images in the ABSTRACT- 


p = 0.33, p-value < 0.01 p = 0.31, p-value = 0.05 



p = 0.16, p-value < 0.01 



p = -0.04, p-value = 0.19 



Figure 9. What makes an image specific? Memorable images, im¬ 
ages with large objects and important object categories tend to be 
more specific. Number of annotated objects in an image does not 
correlate with specificity. Results are on the MEM-5S dataset. 


508 dataset. Figure 10 shows that as the number of images 
used for training increases, the correlation of the predicted 
specificity with the ground truth automated specificity in¬ 
creases. We see that specificity can indeed be predicted 
from just image content better than chance. The use of 
semantic features (e.g. occurence of objects) as opposed 
to low-level features (e.g. DECAF-6) in the ABSTRACT- 
508 dataset seem to make it easier to predict specificity for 
that dataset as compared to the MEM-58 and PA8CAL-508 
datasets. Note that here we are directly predicting auto¬ 
mated specificity whereas in the main paper, we focused 
on predicting the two parameters of the Logistic Regression 
model. The latter is directly relevant to the image search 
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Predicted specificity 



Figure 10. Spearman’s rank correlation between predicted and au¬ 
tomated specificity for increasing number of training images (av¬ 
eraged across 50 random runs). Automated specificity (Section 
3.1.2 in main paper) uses 5, 48 and 50 sentences per image for 
the three datasets, MEM-5S, ABSTRACT-50S and PASCAL-50S 
to estimate the specificity of the image. Predicted specificity (Sec¬ 
tion 7) uses only image features to predict the specificity. Different 
datasets have different number of images in them, hence they stop 
at different points on the x-axis. Higher correlation is better. The 
error bars represented by shaded colors show the standard error of 
the mean (SEM). 


application on which we demonstrated the benefit of speci¬ 
ficity. 

8. Detailed explanation of automated speci¬ 
ficity computation 

In Figure 11, we visually illustrate the equations and 
notations used to automatically compute the similarity be¬ 
tween two sentences (described in Section 3.1.2 in the main 
paper). To measure specificity automatically given the N 
descriptions for image i, we first tokenize the sentences and 
only retain words of length three or more. This ensured that 
semantically itTelevant words, such as ‘a’, ‘of’, etc., were 
not taken into account in the similarity computation (a stan¬ 
dard stop word list could also be used instead). We iden¬ 
tified the synsets (sets of synonyms that share a common 
meaning) to which each (tokenized) word belongs using the 
Natural Language Toolkit [4]. Words with multiple mean¬ 
ings can belong to more than one synset. Let Yau = {Vau} 
be the set of synsets associated with the u-th word from 
sentence Sa- 

Every word in both sentences contributes to the automat¬ 
ically computed similarity simauto(sa, s&) between a pair 
of sentences Sa and The contribution of the u-th word 
from sentence Sa to the similarity is Cau- This contribution 
is computed as the maximum similarity between this word. 


A. The similarity between two sentences is a weighted average of the 
contributions of each word with the TFIDF scores 


SiTUauto(5a» 


iau^au “F thyCfiv 

iau + hy 


Sf77T.auto(Sa) 



B. The contribution is computed as the maximum similarity between 
a word and all words in the other sentence 



C. Similarity between two words is the maximum similarity between all 
pairs of synsets they belong to 

Cqij = max max max simsenseiyau^ybv) 

yau^yauVbv^Ybv 



LEGEND 


"^aulbv 

set of synsets 

■^a/b 

sentences 

^aujbv 

contribution by 
one word 

^au/bv 

TFIDF score of 
one word 


Eigure 11. Illustration of our approach to compute automated sen¬ 
tence similarity. 


and all words in sentence S{, (indexed by v) (Figure IIB). 
The similarity between two words is the maximum simi¬ 
larity between all pairs of synsets (or senses) to which the 
two words have been assigned (Figure IIC). We take the 
maximum because a word is usually used in only one of its 
senses. Concretely, 

t-au — rUctX rUclX max StTn^^^ise (.yau 5 ybv) (8) 

The similarity between senses sinisenseiyau, ybv) is the 
shortest path similarity between the two senses on Word- 
Net [36]. We can similarly define Cbv to be the con¬ 
tribution of u-th word from sentence Sb to the similarity 
sirriantoisa, Sb) between sentences Sa and Sb- 

The similarity between the two sentences is defined as 
the average contribution of all words in both sentences, 
weighted by the importance of each word (Figure 11 A). Let 
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Specificity = 0.68 



(j (1] A view oftheocesn atsunset [2] 
The sunseton the horizon ofthe 
Qj ocean. [3] The waves are gently 
^ breaking on the shore. [4] The ocean 

to 

is at mid-tide under the sunset (9| 
Rolling waves under a sun set 


Specificity = 0.89 
Memorability = 0.66 



covered mountain. [3] A snow 
covered mountain. [4] A mountain 
with snow. [9] A snowy mountain. 


Specificity = 0.66 
Memorability = 0.35 



[1] A forest in the middle of a snowy 
mountainside. [2] There are many 
trees covered with snow. [3] A forest 
covered with fresh snowfoll. [4] The 
mountain loomed in the distance 
over the snowy forest. [S] Snow 
covered sky blue scene with many 


Specificity = 0.87 
Memorability = 0.69 



[1] Two young men trying to fix their 
car together [2] Two men working wi 
a car. [3] The men are working on the 
car. [4] Two men fixing a car [9] Two 
mechanics working on fixing a car. 


Specificity = 0.64 
Memorability = 0.35 



[1] some snow covered mountains. 

[2] This a snovry mountain peak. [3] 
This Is a snow-covered mountain. [4] 
A view of a snowy mountainside [9] 
A glacier between snow covered 
mountains. 


Specificity = 0.85 
Memorability = 0.9 



[1] A person standing at a gun target 
range. [2] A man firing a pistol at a 
shooting range. [3] A man practicing 
how to shoot a gun with ear f^ugs on 
[4] The man is practicing on toe 
shooting range. [9] A man shooting a 
gun at a shooting range. 


Specificity = 0.4 



^ [1] Beige uphcdstered furniture is 

^ placedclosetothewallsofaseating 
U) 

^ . area toatalso seemsto be a 
Q passageway in a hotel. [2] The 
inside of a lobby in a hotel. [3] A 
lobby area with lamps and furniture. 
[4] The long room is lined with 
neutral furniture. [9] A hallway with 
lantos. chairs and sofa well lit for toe 
customers 


Qj Specificity = 0.39 


.Q Memorability = 0.33 




'0 (1] A view ofa riKilti-colored house 
^ [2] A house outside. [3] A white 
(/) house with red tom and a bn'di 
^ chimney. 14] A house with a wall in 
Z kontofit. [9] Atwo story house. 


Specificity = 0.4 
Memorability = 0.8 



[1] There Isa covered bridge over 
water. [2] A bridge with water running 
underneath it [3] Bridge over the 
water [4] a small shed. (9| A covered 
bridge over a river. 


Specificity = 0.4 
Memorability = 0.82 



[1] A man and a woman reading 
magazines in a waiting room. [2] a 
man and a woman under a picture. 
t3] Two people reading on a couch. 
[4] 1Wo people are sitting in chairs 
reading magazines. (9| A couple 
sitting in a waiting room. 


Specificity = 0.33 



Specificity = 0.31 
Memorability = 0.32 


Memorability = 0.33 


Specificity = 0.63 
Memorability = 0.39 



[1] A mountain with a group of 
houses. (2] A village sits in the 
foothills of a rocky hillside. [3] a 
mountain behind some houses. [4] 
Many houses in front of a mountain. 
[9] A large mountain towering overa 


Specificity = 0.8 
Memorability = 0.86 



[1] The backseat of a car. [2] The 
rear seat of a car. [3] A car seal in toe 
back. [4] Leather seat in a car [9] the 
beck seat ofa car. 


Specificity = 0.4 
Memorability = 0.8 



[ 1 ] a table with a bouget on it. [ 2 ] 
Wooden dining room set in a sunny 
room. [3] A dining room table. [4] A 
dining room table with matohing 
chairs In a home. [9] A kitchen table 
under a light in front ofa window. 


Specificity = 0.8 
Memorability = 0.75 





[1] A tall, twisting roller-coaster and 
blue skies [2] a rollercoaster. [3] A 
rollercoaster. [4] A view of a large 
roller coaster [9] A roller coaster. 


Specificity = 0.4 
Memorability = 0.64 



[1] the inMe ofa building. [2] Hail 
with a glass ceiling [3] A giant 
hallway with chandeliers. [4] Things 
are hanging from the celling of this 
large room. [9] A large hall with 
many decorations hanging from toe 
ceiling 


Specificity = 0.78 
Memorability = 0.81 



[1] A bank vault sitting partially open. 

[2] The heavy door to a bank vault 
stands half open and an Inner barred 
cage is visible within, along with 
some safety deposit boxes and a 
cushioned bench seat [3] A bank 
foult with its door opened. [4] A large 
bank vault is standing open. [9] An 
open vault door. 


Specificity = 0.4 
Memorability = 0.82 



[1] People are sitting down waiting. 

[2] A couple of kids in an airport [3] 
sorrre people in an airport [4] People 
In a waiting room [9] A woman rests 
her head on another woman's 
shoulder in a waiting area. 


Specificity = 0.24 
Memorability = 0.34 



[1] Adeck with some people In front 
of a flight of stairs [2] A large area 
with weapons. [3] a battle ship with 
rockets. [4] People on a naval ship. 
[9] There are people looking at a 


Specificity = 0.22 
Memorability = 0.3 



[1] A building In front of a mountain. 

[2] A modem house with windows 
fodng the west [3] A old rock pit and 
a building. [4] A building Is near a 
beach with a log. [9] a building with 


Specificity = 0.11 
Memorability = 0.34 



[1] Neon artwork suspended from toe 
ceiling of an airport terminal. {2] A 
hocky rink. [3] Large empty room 
with shiny floors [4] toe inside of a 
warehouse. [9] Two people are In a 
large area with televisions. 


Figure 12. Examples illustrating the similarity and distinctions between image memorability [19] and image specificity. 


the importance of the u-th word from sentence Sa be tau- 
This importance is computed using term frequency-inverse 
document frequency (TF-IDF) using the seikit-learn soft¬ 
ware package [40]. Words that are rare in the corpus but 
occur frequently in a sentence contribute more to the simi¬ 
larity of that sentence with other sentences. So we have 


si?7rauto(5a5 ^b) — 


taub^au ^bvb^bv 

^au + St) ^bv 


The denominator in Equation 9 ensures that the similar¬ 
ity between two sentences is independent of sentence-length 
and is always between 0 and 1. 

9. Specificity vs. Memorability 


(9) 


11 


In our paper, we have shown that specificity and memo¬ 
rability are cori'elated. However, they are distinct concepts 
and measure different properties of the image. In particu- 

















navigation bar 


lm3Q6 SpSCificity MEM-SS WVSCAL-SOS ABSTRACT-SOS supplementary p<jf 


Search 

search box-^ dog woman 

2/888 images 

numberof images 
matching criteria 

Specificity 

0^5-0.65 


sliders for filtering- 


recognize place 
0 - 1-0 


\' 



[1] A vet looking at a dog. [2] A vet 
examining a dog [3] a man 
Inspecting a puppy. [4] Two people 
examining a white and tan colored 
dog [S] The dog is at the vet with a 
man and a woman. 


Specificrty = 0.35 



[1] A woman examining a white and 
brown colored dog [2] Acute puppy. 
[3] a woman looking at a puppy. [4) 
Vetertnarian placing a stethoscope 
on the chestof a whito and t>rown 
bulldog In a room with a dock on the 
wall readlng2:SS.[S1 Asad white 
dog being Inspected by a 


Sentence 

descriptions 


Figure 13. Dataset browser for exploring the datasets. Available on the authors’ webpages. 


lar, we have shown that peaceful and picture-perfect scenes 
are negatively correlated with memorability but have no ef¬ 
fect on specificity. In Figure 12, we show examples of im¬ 
ages that are specific/not specific and memorable/not mem¬ 
orable. Note how outdoor scenes tend to be not very mem¬ 
orable but can have a reasonably high specificity score. 

10. Website for exploring datasets 

Here, we describe the website interface available on the 
authors’ webpages that can be used to explore the datasets 
used in the paper. A navigation bar on top of the website 
allows users to switch between different datasets. Figure 13 
shows how the search function can be used to look for sen¬ 
tences containing the words “dog” and “woman”. Up to 
a maximum of 6 words can be added in the search box. 
Only whole words are matched. The reader should note that 
the website does not implement the text-based search algo¬ 
rithms discussed in the paper. It is meant for only browsing 
the datasets. Sliders on the left allow the user to filter im¬ 
ages according to a range of scores that the images satisfy. 
All the criteria are combined using logical AND to display 
the filtered images. The number of images matching the 
search criteria gives the user an idea of how often two or 
more criteria are satisfied concurrently. The benefit of us¬ 
ing such a website is that it can give the readers an intuition 
of the underlying data and factors that affect specificity. We 
have added sliders for the attributes that correlate most (top 
10) and least (bottom 10) with specificity (for the MEM-5S 
dataset). It is also possible to filter by average length of the 
sentences and the memorability score. 

Glossary 

automated specificity Specificity computed from image textual descrip¬ 
tions by averaging automatically computed sentence similaiities (Section 


3.1.2) [3, 5-10] ground-truth LR/specificity Specificity computed using 
Logistic Regression parameters estimated from image textual descriptions 
(Section 3.2.3) [4, 5, 7, 8] human specificity Specificity measured from 
image textual descriptions by averaging human-annotated sentence sim¬ 
ilarities (Section 3.1.1) [3, 5, 6, 9] predicted LR/specificity Specificity 
computed using Logistic Regression parameters predicted from image fea¬ 
tures (Section 3.2.4) [4, 5, 7, 8] predicted specificity Specificity computed 
from image features without any textual descriptions [9, 10] 
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